Max Piazza
David Orona
Nithya Arumugam
Abhitej Bokka

Introduction

In an ever-volatile market where every dollar counts, the used car market represents a critical sector of the consumer industry. With rising consumer demand and an increasing variety of vehicles entering the secondary market, understanding the factors that influence used car prices is essential. Buyers seek to make informed decisions based on value for money, while sellers aim to maximize returns by accurately pricing their vehicles. Bridging this gap requires a data-driven approach to uncover the relationships between vehicle specifications and market pricing.

This project utilizes the “Vehicle dataset” by Nehal Birla, Nishant Verma, and Nikhil Kushwaha, available on Kaggle at https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho/data?select=Car+details+v3.csv.

This dataset aggregates detailed information on over 10,000 used cars, including key variables such as fuel type, transmission, engine capacity, mileage, and kilometers driven, alongside categorical variables like seller type, ownership history, and geographic location.

Our analysis is driven by three core objectives:

Using statistical modeling techniques, including regression analysis, we aim to deliver a robust and interpretable model that not only predicts car prices but also highlights the most influential factors driving price variations. Our results will shed light on market dynamics, offering actionable insights for both consumers and industry professionals navigating this volatile space.

By the conclusion of this project, we aim to provide a detailed analysis that enhances understanding of the used car market, aiding stakeholders in making informed decisions in an ever-changing economic landscape.

The dataset contains the following attributes:

Attribute Name Description
1 name The make and model of the vehicle (e.g., Hyundai i10, Honda City).
2 year The year the vehicle was manufactured.
3 selling_price The selling price of the vehicle in Indian Rupees (INR).
4 km_driven The total distance the vehicle has been driven (in km).
5 fuel The type of fuel used by the vehicle (“Diesel”, “Petrol”, “LPG”, or “CNG”).
6 seller_type The type of seller (“Individual”, “Dealer” or “Trustmark Dealer”).
7 transmission The type of transmission system (“Manual” or “Automatic”).
8 owner The number of previous owners of the vehicle (“First Owner”, “Second Owner”, “Third Owner”, “Fourth and Above Owner”, or “Test Drive Car”).
9 mileage The fuel efficiency of the vehicle (in either km/l or km/kg).
10 engine The engine displacement capacity (in CC).
11 max_power The maximum power output of the vehicle’s engine, measured in brake horsepower (bhp).
12 torque A pair of torque and RPM values representing the max torque of the vehicle. The torque values are in either Nm or kgm, and the RPM values are either values or value ranges.
13 seats The seating capacity of the vehicle as an integer.

The above attributes can be categorized into numeric and categorical variables. Numeric variables can be separated into discrete and continuous variables.

Attribute Type Attribute Name
Categorical Variables name, fuel, seller_type, transmission, owner
Discrete Numeric Variables year, km_driven, seats
Continuous Numeric Variables selling_price, mileage, engine, max_power, torque

Methods

# Set document-wide options 
options(tibble.width = Inf)  # Make tibbles display all columns

Load dataset

car_details = read_csv("Car details v3.csv")

Total number of observations

nrow(car_details)
## [1] 8128

Sample of data from Vehicle dataset

head(car_details)
## # A tibble: 6 × 13
##   name                           year selling_price km_driven fuel   seller_type
##   <chr>                         <dbl>         <dbl>     <dbl> <chr>  <chr>      
## 1 Maruti Swift Dzire VDI         2014        450000    145500 Diesel Individual 
## 2 Skoda Rapid 1.5 TDI Ambition   2014        370000    120000 Diesel Individual 
## 3 Honda City 2017-2020 EXi       2006        158000    140000 Petrol Individual 
## 4 Hyundai i20 Sportz Diesel      2010        225000    127000 Diesel Individual 
## 5 Maruti Swift VXI BSIII         2007        130000    120000 Petrol Individual 
## 6 Hyundai Xcent 1.2 VTVT E Plus  2017        440000     45000 Petrol Individual 
##   transmission owner        mileage    engine  max_power 
##   <chr>        <chr>        <chr>      <chr>   <chr>     
## 1 Manual       First Owner  23.4 kmpl  1248 CC 74 bhp    
## 2 Manual       Second Owner 21.14 kmpl 1498 CC 103.52 bhp
## 3 Manual       Third Owner  17.7 kmpl  1497 CC 78 bhp    
## 4 Manual       First Owner  23.0 kmpl  1396 CC 90 bhp    
## 5 Manual       First Owner  16.1 kmpl  1298 CC 88.2 bhp  
## 6 Manual       First Owner  20.14 kmpl 1197 CC 81.86 bhp 
##   torque                   seats
##   <chr>                    <dbl>
## 1 190Nm@ 2000rpm               5
## 2 250Nm@ 1500-2500rpm          5
## 3 12.7@ 2,700(kgm@ rpm)        5
## 4 22.4 kgm at 1750-2750rpm     5
## 5 11.5@ 4,500(kgm@ rpm)        5
## 6 113.75nm@ 4000rpm            5

Data cleaning

Excluding missing values from dataset

car_details = na.omit(car_details)
nrow(car_details)
## [1] 7906

Excluding duplicate rows from dataset

sum(duplicated(car_details))
## [1] 1189
car_details = car_details[!duplicated(car_details),]
nrow(car_details)
## [1] 6717

Transforming name variable

# Extract the first word from "name" to get the make of the vehicle
car_details$name = word(car_details$name,1)
# Rename "name" variable to "make"
colnames(car_details)[1] = "make"
# Change datatypes of "make" from character to factor
car_details$make = as.factor(car_details$make)

Transforming mileage variable

# View raw data for "mileage" variable
head(car_details$mileage, n=20)
##  [1] "23.4 kmpl"  "21.14 kmpl" "17.7 kmpl"  "23.0 kmpl"  "16.1 kmpl" 
##  [6] "20.14 kmpl" "17.3 km/kg" "16.1 kmpl"  "23.59 kmpl" "20.0 kmpl" 
## [11] "19.01 kmpl" "17.3 kmpl"  "19.3 kmpl"  "18.9 kmpl"  "18.15 kmpl"
## [16] "24.52 kmpl" "23.0 kmpl"  "19.7 kmpl"  "22.54 kmpl" "21.0 kmpl"
# Drop 86 rows where "mileage" value contains the text "km/kg"
car_details <- car_details[!grepl("km/kg", car_details$mileage), ]
# Extract the numeric value from "mileage" 
car_details$mileage = word(car_details$mileage,1)
# Change data_type of "mileage" from character to numeric
car_details$mileage=as.numeric(car_details$mileage)

# Rename "mileage" to "fuel_efficiency"
colnames(car_details)[which(names(car_details) == "mileage")] = "fuel_efficiency"

# View new data for "fuel_efficiency" variable
head(car_details$fuel_efficiency, n=20)
##  [1] 23.40 21.14 17.70 23.00 16.10 20.14 16.10 23.59 20.00 19.01 17.30 19.30
## [13] 18.90 18.15 24.52 23.00 19.70 22.54 21.00 25.50

Above, the mileage variable is renamed to fuel_efficiency to prevent confusion with the km_driven variable, which represents the quantity normally referred to as vehicle “mileage”.

Transforming engine variable

# View raw data for "engine" variable
head(car_details$engine, n=20)
##  [1] "1248 CC" "1498 CC" "1497 CC" "1396 CC" "1298 CC" "1197 CC" "796 CC" 
##  [8] "1364 CC" "1399 CC" "1461 CC" "993 CC"  "1248 CC" "1061 CC" "1198 CC"
## [15] "1248 CC" "1396 CC" "796 CC"  "1396 CC" "1461 CC" "1498 CC"
# Extract the numeric value from each entry (remove the text "CC" from the end of each entry)
car_details$engine = word(car_details$engine,1)
# Change data_type of "engine" from character to numeric
car_details$engine=as.numeric(car_details$engine)
# View new data for "engine" variable
head(car_details$engine, n=20)
##  [1] 1248 1498 1497 1396 1298 1197  796 1364 1399 1461  993 1248 1061 1198 1248
## [16] 1396  796 1396 1461 1498

Transforming max_power variable

# Transform "max_power" variable
# View raw data for "max_power" variable
head(car_details$max_power, n=20)
##  [1] "74 bhp"     "103.52 bhp" "78 bhp"     "90 bhp"     "88.2 bhp"  
##  [6] "81.86 bhp"  "37 bhp"     "67.1 bhp"   "68.1 bhp"   "108.45 bhp"
## [11] "60 bhp"     "73.9 bhp"   "67 bhp"     "82 bhp"     "88.5 bhp"  
## [16] "90 bhp"     "46.3 bhp"   "88.73 bhp"  "64.1 bhp"   "98.6 bhp"
# Extract the numeric value from each entry of "max_power" (remove the text "bhp" from the end of each entry)
car_details$max_power = word(car_details$max_power,1)
# Change data_type of "max_power" from character to numeric
car_details$max_power=as.numeric(car_details$max_power)
# View new data for "max_power" variable
head(car_details$max_power, n=20)
##  [1]  74.00 103.52  78.00  90.00  88.20  81.86  37.00  67.10  68.10 108.45
## [11]  60.00  73.90  67.00  82.00  88.50  90.00  46.30  88.73  64.10  98.60

Converting character-typed columns to factor type

# Change datatypes of "fuel", "seller_type", "transmission", and "owner"
# from character to factor 
car_details$fuel=as.factor(car_details$fuel)
car_details$seller_type=as.factor(car_details$seller_type)
car_details$transmission=as.factor(car_details$transmission)
car_details$owner=as.factor(car_details$owner)

Converting km_driven to km_driven_in_10k

# Modify the "km_driven" column by dividing the values of each entry by 10,000 and renaming the column to "km_driven_in_10k" 
car_details$km_driven = car_details$km_driven/10000
# Rename km_driven
colnames(car_details)[4]="km_driven_in_10k"
# View data for "km_driven_in_10k" column
head(car_details)[4]
## # A tibble: 6 × 1
##   km_driven_in_10k
##              <dbl>
## 1             14.6
## 2             12  
## 3             14  
## 4             12.7
## 5             12  
## 6              4.5

Converting selling_price to selling_price_in_10k

# Modify "selling_price" column by dividing the values of each entry by 10,000 and renaming the column to "selling_price_in_10k"
car_details$selling_price = car_details$selling_price/10000
# Rename selling_price to "selling_price_in_10k"
colnames(car_details)[3]="selling_price_in_10k"
# View data for "selling_price_in_10k" column
head(car_details)[3]
## # A tibble: 6 × 1
##   selling_price_in_10k
##                  <dbl>
## 1                 45  
## 2                 37  
## 3                 15.8
## 4                 22.5
## 5                 13  
## 6                 44

Dropping torque column

# Drop "torque" column
car_details = car_details[,!names(car_details) %in% "torque"]

Creating make_category variable from make variable

unique(car_details$make)
##  [1] Maruti        Skoda         Honda         Hyundai       Toyota       
##  [6] Ford          Renault       Mahindra      Tata          Chevrolet    
## [11] Datsun        Jeep          Mercedes-Benz Mitsubishi    Audi         
## [16] Volkswagen    BMW           Nissan        Lexus         Jaguar       
## [21] Land          MG            Volvo         Daewoo        Kia          
## [26] Fiat          Force         Ambassador    Ashok         Isuzu        
## [31] Opel         
## 31 Levels: Ambassador Ashok Audi BMW Chevrolet Daewoo Datsun Fiat ... Volvo
  • From the above result we can see that there are 31 unique values for “make”.
  • When we model the data using the independent variable “make”, there will be at least 30 dummy variables as predictors.
  • To reduce the model complexity and to increase interpretability, the car make can be grouped into broader categories as “Budget”, “Mid-Range” or “Luxury” depending on general market perception.
# Create new column "make_category" from "make" column
car_details$make_category = ifelse(car_details$make %in% 
  c("Ambassador", "Ashok", "Daewoo", "Datsun", "Opel", "Fiat"), 
  "Budget",
  ifelse(car_details$make %in% 
           c("Chevrolet", "Maruti", "Renault", "Mitsubishi", "Ford", 
             "Honda", "Hyundai", "Isuzu", "Kia", "Toyota", "Force", 
             "Volkswagen", "Tata", "Skoda", "Jeep", "MG", "Nissan", "Mahindra"), 
         "Midrange", 
         "Luxury"
  )
)

# Convert make_category to a factor
car_details$make_category = as.factor(car_details$make_category)

# Output all levels of the make_category variable
levels(car_details$make_category)
## [1] "Budget"   "Luxury"   "Midrange"
# Create a summary table of makes and their categories and view the mapping
make_category_mapping <- unique(car_details[, c("make", "make_category")])
make_category_mapping_df <- as.data.frame(make_category_mapping)
print(make_category_mapping_df, row.names = FALSE)  
##           make make_category
##         Maruti      Midrange
##          Skoda      Midrange
##          Honda      Midrange
##        Hyundai      Midrange
##         Toyota      Midrange
##           Ford      Midrange
##        Renault      Midrange
##       Mahindra      Midrange
##           Tata      Midrange
##      Chevrolet      Midrange
##         Datsun        Budget
##           Jeep      Midrange
##  Mercedes-Benz        Luxury
##     Mitsubishi      Midrange
##           Audi        Luxury
##     Volkswagen      Midrange
##            BMW        Luxury
##         Nissan      Midrange
##          Lexus        Luxury
##         Jaguar        Luxury
##           Land        Luxury
##             MG      Midrange
##          Volvo        Luxury
##         Daewoo        Budget
##            Kia      Midrange
##           Fiat        Budget
##          Force      Midrange
##     Ambassador        Budget
##          Ashok        Budget
##          Isuzu      Midrange
##           Opel        Budget

Final structure of the car_details dataset

Attribute Name Description Variable Type
make The manufacturer of the vehicle (e.g., Maruti, Hyundai, Honda, etc.). Factor
year The year the vehicle was manufactured. Numeric
selling_price_in_10k The selling price of the vehicle in ten-thousands of Indian Rupees (INR). Numeric
km_driven_in_10k The total distance the vehicle has been driven, measured in ten-thousands of kilometers. Numeric
fuel The type of fuel used by the vehicle (“Diesel”, “Petrol”, “LPG”, or “CNG”). Factor
seller_type The type of seller (“Individual”, “Dealer”, or “Trustmark Dealer”). Factor
transmission The type of transmission system (“Manual” or “Automatic”). Factor
owner The number of previous owners of the vehicle (“First Owner”, “Second Owner”, “Third Owner”, “Fourth and Above Owner”, or “Test Drive Car”). Factor
fuel_efficiency The fuel efficiency of the vehicle in km/l. Numeric
engine The engine displacement capacity of the vehicle, measured in cubic centimeters (CC). Numeric
max_power The maximum power output of the vehicle’s engine, measured in brake horsepower (BHP). Numeric
seats The seating capacity of the vehicle. Numeric
make_category Categorized vehicle make (e.g., “Luxury”, “Midrange”, or “Budget”). Factor

Sample of final dataset

head(car_details, n=8)
## # A tibble: 8 × 13
##   make     year selling_price_in_10k km_driven_in_10k fuel   seller_type
##   <fct>   <dbl>                <dbl>            <dbl> <fct>  <fct>      
## 1 Maruti   2014                 45               14.6 Diesel Individual 
## 2 Skoda    2014                 37               12   Diesel Individual 
## 3 Honda    2006                 15.8             14   Petrol Individual 
## 4 Hyundai  2010                 22.5             12.7 Diesel Individual 
## 5 Maruti   2007                 13               12   Petrol Individual 
## 6 Hyundai  2017                 44                4.5 Petrol Individual 
## 7 Maruti   2001                  4.5              0.5 Petrol Individual 
## 8 Toyota   2011                 35                9   Diesel Individual 
##   transmission owner        fuel_efficiency engine max_power seats make_category
##   <fct>        <fct>                  <dbl>  <dbl>     <dbl> <dbl> <fct>        
## 1 Manual       First Owner             23.4   1248      74       5 Midrange     
## 2 Manual       Second Owner            21.1   1498     104.      5 Midrange     
## 3 Manual       Third Owner             17.7   1497      78       5 Midrange     
## 4 Manual       First Owner             23     1396      90       5 Midrange     
## 5 Manual       First Owner             16.1   1298      88.2     5 Midrange     
## 6 Manual       First Owner             20.1   1197      81.9     5 Midrange     
## 7 Manual       Second Owner            16.1    796      37       4 Midrange     
## 8 Manual       First Owner             23.6   1364      67.1     5 Midrange

Data Analysis

Variable Distribution Analysis

car_details$selling_price_in_10k[car_details$selling_price_in_10k > 720]
## [1] 1000
car_details=subset(car_details,subset = car_details$selling_price_in_10k < 720,)
  • There is one observation, which is quite different from the general pattern of selling price. This can impact the model that we build, hence excluded one observation with selling_price = 1000.
Selling Price distribution
# Histogram of selling prices 
hist(car_details$selling_price_in_10k,xlab="Selling Price (in 10,000 INR)",
     main="Selling Price distribution",
     breaks = 30, 
     col = "lightblue")

  • The above plot is positively skewed, meaning the selling prices for most of the observations are less than or equal to 2 million INR, and there are much fewer observations with selling prices above this amount.
Frequency Distribution of Categorical Variables
make_counts = table(car_details$make)
fuel_type_car_count = table(car_details$fuel)
seller_type_car_count = table(car_details$seller_type)
trans_type_car_count = table(car_details$transmission)
owner_type_car_count = table(car_details$owner)

par(mfrow = c(2, 3))
# Plot 1: Number of cars by make
barplot(sort(make_counts),horiz=TRUE, las = 1,
        xlab = "Number of Cars",
        ylab = "Car Make",
        col = "lightblue",
        cex.names = 0.5,
        main = "Num. of Cars in Each Make")

# Plot 2: Number of cars by fuel type
barplot(sort(fuel_type_car_count), horiz = TRUE, las = 1, cex.names = 0.9,
        xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Fuel Type")

# Plot 3: Number of cars by seller type
barplot(sort(seller_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
        xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Seller Type")

# Plot 4: Number of cars by transmission type
barplot(sort(trans_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
        xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Trans Type")

# Plot 5: Number of cars by owner type
barplot(sort(owner_type_car_count), horiz = TRUE, las = 1, cex.names = 0.6,
        xlab = "Number of Cars",col = "lightblue", main = "Num. of Cars in Each Owner Type")

Boxplots of Selling Price by Different Categorical Variables
Selling Price by Car Make
# Boxplot of selling price by car make
boxplot(selling_price_in_10k ~ make, data = car_details,col=rainbow(length(unique(car_details$make))),
        las = 2,cex.axis = 0.7,               
        main = "Boxplot of Selling Price by Car Make",
        xlab = "Car Make",
        ylab = "Selling Price (10,000 INR)")

  • The above boxplots indicates a significant variation in selling prices across different car makes. Some brands have much wider price ranges than others.
  • Brands like Mercedes-Benz, BMW, Jaguar, Land Rover and Volvo have the highest median selling prices, while cars like Tata, Maruti, and Daewoo have the lowest median selling prices.
Selling Price by Fuel Type
# Boxplot of selling price by fuel type
boxplot(selling_price_in_10k ~ fuel, data = car_details, cex.axis = 0.8,
        col = rainbow(length(unique(car_details$fuel))),
        main = "Box Plot of Selling Price by Fuel Type",
        xlab = "Fuel Type", ylab = "Selling Price (10,000 INR)")

  • From the above boxplot we can see that the median selling price of Diesel cars are slightly higher than the median selling price of Petrol cars.
Selling Price by Seller Type
# Boxplot of selling price by seller type
boxplot(selling_price_in_10k ~ seller_type, data = car_details, cex.axis = 0.6,
        col = rainbow(length(unique(car_details$seller_type))),
        main = "Box Plot of Selling Price by Seller Type",
        xlab = "Seller Type", ylab = "Selling Price (10,000 INR)")

  • The cars sold by Individuals have the lowest median cost when compared to cars sold by dealers.
Selling Price by Transmission Type
# Boxplot of selling price by transmission type
plot(selling_price_in_10k ~ transmission, data = car_details,
     col = rainbow(length(unique(car_details$transmission))),
     main = "Boxplot of Selling Price by Transmission Type",
     xlab = "Transmission Type", 
     ylab = "Selling Price (10,000 INR)")

  • We can see that there is a difference in the selling price of the Automatic and Manual transmission cars. The cost range of Automatic transmission cars is higher than that of Manual transmission cars.
Selling Price by Owner Type
# Boxplot of selling price by owner type
plot(selling_price_in_10k ~ owner, data = car_details, las = 2, cex.axis = 0.5,
     col = rainbow(length(unique(car_details$owner))),
     main = "Boxplot of Selling Price vs Owner Type",
     xlab = "Owner Type", 
     ylab = "Selling Price (10,000 INR)")

  • The selling prices of cars across different owner types are significantly different.
  • The median price of test drive cars are very high and the rest of the owner types have low median cost
  • The median selling price is in the decreasing order of First Owner, Second Owner, Third Owner, Fourth & above Owner.

Scatter Plots of Selling Price vs Different Numerical Variables

Selling Price vs Year and Transmission Type
# Scatter plot of selling price vs year and transmission type
colours = ifelse(car_details$transmission == "Automatic", "blue", "red")
plot(selling_price_in_10k ~ year,data=car_details,col=colours,pch=19,
     main = "Selling Price vs Year and Transmission Type",
     xlab = "Year",
     ylab = "Selling Price (10,000 INR)")
legend("topleft", legend = c("Automatic", "Manual"),
       col = c("blue", "red"), pch = 19)

  • We can see a positive correlation between year and selling price. As the year increases, the selling price is increasing.
  • This suggests that newer cars are priced higher than older ones.
  • We can also see that automatic transmission cars have a higher selling price across all years.
Selling Price vs Year and Make Category
# Scatter plot of selling price vs year and make category
colours = c("Budget" = "blue", "Midrange"="green","Luxury"= "orange")
plot(selling_price_in_10k ~ year,data=car_details,
     col = colours[car_details$make_category],pch=19,
     main = "Selling Price vs Year and Make Category",
     xlab = "Year",
     ylab = "Selling Price (10,000 INR)")
legend("topleft", legend = c("Budget","Midrange","Luxury"),
       col = c("blue", "green","orange"), pch = 19)

  • Similar to previous plot, there is a positive correlation between year and selling price.
  • We see that cost of Budget cars, mid range cars and luxury car increases with Year.
Selling Price vs Km Driven
# Scatter plot of selling price vs km driven
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
     xlab="Km Driven (10,000 km)",
     ylab="Selling Price (10,000 INR)",
     main="Selling Price vs Km Driven")

# Subset out extreme values
car_details = subset(car_details,car_details$km_driven_in_10k < 100,)

# Scatter plot of selling price vs km driven after removing extreme values
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
     xlab="Km Driven (10,000 km)",
     ylab="Selling Price (10,000 INR)",
     main="Selling Price vs Km Driven (extreme values removed)")

  • The relationship between selling price and kilometers driven doesn’t seem to be strongly linear.
  • But we can see that as km driven increases, the selling price remains in low range.
  • There are 2 observations which are different from the general pattern with values of km driven (150.0000 236.0457). This can been seen in the above plot.
  • These observations can impact the model. Hence those two data points are excluded from the second plot.
Selling Price vs Fuel Efficiency
# Subset out 15 observations with fuel_efficiency=0 
car_details = subset(car_details, car_details$fuel_efficiency!=0,)

# Scatter plot of selling price vs fuel efficiency
plot(selling_price_in_10k ~ fuel_efficiency, data=car_details,
     xlab="Fuel Efficiency (km/l)",
     ylab="Selling Price (10,000 INR)",
     main="Selling Price vs Fuel Efficiency")

  • Most data points are clustered at fuel efficiency values of 10 to 30 km/l, and there doesn’t seem to be a linear relationship.
  • That is higher fuel efficiency doesn’t indicate higher selling price.
Selling Price vs Engine Displacement

  • From the above plot, we can see that the selling price is high for higher values of engine power
Selling Price vs Max Power
# Scatter plot of selling price vs max power (by fuel type)
colours = c("Petrol" = "blue", "Diesel"="green")
plot(selling_price_in_10k ~ max_power, data=car_details,
     col=colours[car_details$fuel],
     main="Selling Price vs Max Power (by Fuel Type)",
     xlab="Max Power (bhp)",
     ylab="Selling Price (10,000 INR)" )
legend("topleft", legend = c("Petrol","Diesel"),
       col = c("blue", "green"), pch = 19)

  • From the above plot, we can see that the there is a linear relationship between max power and selling price.
  • We can conclude that when the maximum power increases, the selling price of the car increases
  • In general, we see that diesel-powered cars have higher selling prices.
    • But in the above plot, we see that petrol-powered cars with higher horsepower have higher selling prices.
Conclusions of Variable Distribution Analysis
  • The selling price distribution is positively skewed. Positively skewed data has extreme values which makes it hard to fit models.
    • Hence, logarithmic transformation can make the selling price distribution to be normally distributed.
    • Logarithmic transformations can also help stabilize the variance, making the data more homoscedastic and suitable for analysis.
  • There is a positive correlation between year and selling price.
  • The selling price also tends to increase with engine displacement and max power.
  • Prices of budget cars, midrange cars and luxury car increase with Year.
  • As the km driven increases, the selling price tends to decrease.
  • There is no impact of fuel efficiency on selling price.
  • The median price of automatic transmission cars is higher than that of manual transmission cars.
  • The median price of test drive cars is very high when compared to other owner types.
  • Prices of diesel cars are generally high. But prices of petrol cars with high horsepower are also high.

Analysis of Correlation Between Numeric Variables

# Pairs plot for numeric variables
pairs(selling_price_in_10k ~ year + km_driven_in_10k + fuel_efficiency + engine + max_power + seats, data = car_details)

# Calculate correlation matrix for numeric variables
cor_mat = cor(car_details[, sapply(car_details, is.numeric)])
cor_mat
##                             year selling_price_in_10k km_driven_in_10k
## year                  1.00000000            0.4395767      -0.45496390
## selling_price_in_10k  0.43957666            1.0000000      -0.19812586
## km_driven_in_10k     -0.45496390           -0.1981259       1.00000000
## fuel_efficiency       0.37352047           -0.1258892      -0.24186898
## engine               -0.02333154            0.4480476       0.30860425
## max_power             0.15993052            0.6857687       0.04886655
## seats                 0.02630866            0.1648269       0.24955889
##                      fuel_efficiency      engine   max_power       seats
## year                       0.3735205 -0.02333154  0.15993052  0.02630866
## selling_price_in_10k      -0.1258892  0.44804759  0.68576871  0.16482693
## km_driven_in_10k          -0.2418690  0.30860425  0.04886655  0.24955889
## fuel_efficiency            1.0000000 -0.59909460 -0.40078564 -0.48552947
## engine                    -0.5990946  1.00000000  0.68399138  0.66312580
## max_power                 -0.4007856  0.68399138  1.00000000  0.25911446
## seats                     -0.4855295  0.66312580  0.25911446  1.00000000
# Extract high correlation values from the upper triangle of the correlation 
# matrix (to remove redundant correlations caused by symmetry)
high_cor_indices = which(upper.tri(cor_mat) & cor_mat > 0.5, arr.ind = TRUE)

# Create table for high-correlation variables
high_cor_df = data.frame(
  row = rownames(cor_mat)[high_cor_indices[, 1]],
  column = colnames(cor_mat)[high_cor_indices[, 2]],
  correlation = cor_mat[high_cor_indices]
)

# Display the first few entries in the table
head(high_cor_df)
##                    row    column correlation
## 1 selling_price_in_10k max_power   0.6857687
## 2               engine max_power   0.6839914
## 3               engine     seats   0.6631258
  • From the above table of high-correlation variable pairs, we can see that the following variable pairs have correlations above 0.5:
    • max_power and selling_price have a high correlation of 0.6872307. Hence, selling price increases with max power.
    • max_power and engine have a high correlation of 0.6863027. This indicates that higher-displacement engines tend to be more powerful.
    • seats and engine have high a correlation of 0.6631. This indicates that vehicles with more seats tend to have larger engines.

Model Development and Validation

In this section, a model for predicting values of selling_price based on the values of the other variables is developed and the performance of the model is analyzed.

Splitting data into train and test sets

set.seed(125) # For reproducibility
train_indices = sample(nrow(car_details), size = 0.80 * nrow(car_details))
train_data = car_details[train_indices, ]
test_data = car_details[-train_indices, ]
nrow(train_data)
## [1] 5289
nrow(test_data)
## [1] 1323

Additive “full” linear model (using all available predictors except make)

# Definition of "full" linear model
full_model = lm(selling_price_in_10k ~ .-make, data = train_data)

# Model summary
summary(full_model)
## 
## Call:
## lm(formula = selling_price_in_10k ~ . - make, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -207.237  -11.048   -1.612    8.107  311.373 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -8.288e+03  2.832e+02 -29.269  < 2e-16 ***
## year                         4.129e+00  1.419e-01  29.098  < 2e-16 ***
## km_driven_in_10k            -8.828e-01  9.577e-02  -9.218  < 2e-16 ***
## fuelPetrol                  -1.064e+01  1.201e+00  -8.858  < 2e-16 ***
## seller_typeIndividual       -7.892e+00  1.299e+00  -6.074 1.33e-09 ***
## seller_typeTrustmark Dealer -4.949e+00  5.549e+00  -0.892 0.372486    
## transmissionManual          -1.221e+01  1.569e+00  -7.782 8.51e-15 ***
## ownerFourth & Above Owner   -2.216e+00  2.671e+00  -0.830 0.406730    
## ownerSecond Owner           -5.364e+00  9.265e-01  -5.790 7.47e-09 ***
## ownerTest Drive Car          2.421e+02  1.222e+01  19.817  < 2e-16 ***
## ownerThird Owner            -3.792e+00  1.542e+00  -2.459 0.013979 *  
## fuel_efficiency             -6.184e-01  1.814e-01  -3.408 0.000658 ***
## engine                       9.462e-03  1.801e-03   5.254 1.55e-07 ***
## max_power                    4.977e-01  2.041e-02  24.379  < 2e-16 ***
## seats                       -8.198e-02  5.853e-01  -0.140 0.888617    
## make_categoryLuxury          1.195e+02  4.550e+00  26.274  < 2e-16 ***
## make_categoryMidrange        1.053e+01  3.093e+00   3.403 0.000672 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.97 on 5272 degrees of freedom
## Multiple R-squared:  0.7107, Adjusted R-squared:  0.7098 
## F-statistic: 809.5 on 16 and 5272 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(full_model))
## [1] 17
#Leave one out cross validation RMSE 
sqrt(mean((resid(full_model) / (1 - hatvalues(full_model))) ^ 2))
## [1] 27.37401
# Diagnostic plots
par(mfrow = c(1, 2))
plot(full_model,which=c(1,2))

# Diagnostic tests
bp_test <- bptest(full_model)
print(bp_test)
## 
##  studentized Breusch-Pagan test
## 
## data:  full_model
## BP = 1499.3, df = 16, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(full_model), size = 5000)
shapiro_test = shapiro.test(residual_subset)
print(shapiro_test)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.7217, p-value < 2.2e-16
# Multicollinearity test
vif(full_model)
##                      GVIF Df GVIF^(1/(2*Df))
## year             2.190700  1        1.480101
## km_driven_in_10k 1.634226  1        1.278369
## fuel             2.588377  1        1.608843
## seller_type      1.112922  2        1.027108
## transmission     1.446943  1        1.202889
## owner            1.412292  4        1.044096
## fuel_efficiency  3.683989  1        1.919372
## engine           5.759182  1        2.399830
## max_power        3.026818  1        1.739775
## seats            2.419890  1        1.555599
## make_category    1.562362  2        1.118009

The “full” linear model has a high \(R^2\) value of 0.7212, indicating that 72.12% of the variance in selling_price_in_10k is explained by the predictors.

Most predictors are statistically significant (p-value < 0.05), suggesting they contribute meaningfully to the model. Exceptions are:

  • seller_typeTrustmark Dealer (p = 0.33)
  • ownerFourth & Above Owner (p = 0.14)
  • seats (p = 0.48)

From this initial model, it is apparent:

  • year is positively correlated with price (newer cars sell for higher prices)
  • engine, max_power, make_category:Luxury are positively correlated with selling prices (more powerful vehicles and luxury vehicles sell for more)
  • km_driven_in_10k, fuelPetrol, and fuel_efficiency are negatively correlated with selling price: higher mileage vehicles and petrol vehicles are associated with lower selling prices

Diagnostic testing for this model:

  • The Breusch-Pagan test yielded a very high test statistic of 1538.2, and a p-value less than \(2.2*10^{-16}\), which strongly indicates heteroscedasticity. This is reflected by the “fat tails” in the QQ plot.

  • The Shapiro-Wilk test yielded a low p-value less than \(2.2*10^{-16}\), which strongly rejects the null hypothesis that the residuals are normally distributed.

For this model, VIF analysis shows:

  • All VIFs are below 5, indicating no severe multicollinearity among predictors. However:
    • engine (VIF = 2.36) and fuel_efficiency (VIF = 1.80) have relatively higher multicollinearity (but they still within acceptable ranges).

Logarithmic transformation of response variable

  • From the previous Fitted vs Residual plot we can see there is a pattern in the residuals and the constant assumption is violated.
  • Logarithmic transformation can make the positively skewed distribution of selling price to be normally distributed and makes it more suitable for model building and for stabilizing the variance.
train_data$selling_price_in_10k = log(train_data$selling_price_in_10k)
test_data$selling_price_in_10k = log(test_data$selling_price_in_10k)

Additive “full” linear model (using all available predictors except make) | with BIC

# Definition of "full" linear model with BIC-based stepwise model selection starting from full model (both directions)
n= nrow(train_data)
bic_full_add_model = step(lm(selling_price_in_10k ~ .-make, data = train_data),direction="both",k=log(n),trace=0)

# Model summary
summary(bic_full_add_model)
## 
## Call:
## lm(formula = selling_price_in_10k ~ (make + year + km_driven_in_10k + 
##     fuel + seller_type + transmission + owner + fuel_efficiency + 
##     engine + max_power + seats + make_category) - make, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.68520 -0.16717  0.02508  0.19042  1.42921 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -2.217e+02  3.054e+00 -72.603  < 2e-16 ***
## year                         1.112e-01  1.530e-03  72.642  < 2e-16 ***
## km_driven_in_10k            -3.850e-03  1.033e-03  -3.728 0.000195 ***
## fuelPetrol                  -1.369e-01  1.296e-02 -10.569  < 2e-16 ***
## seller_typeIndividual       -7.010e-02  1.401e-02  -5.003 5.82e-07 ***
## seller_typeTrustmark Dealer  8.294e-02  5.984e-02   1.386 0.165779    
## transmissionManual          -9.709e-02  1.692e-02  -5.737 1.02e-08 ***
## ownerFourth & Above Owner   -1.212e-01  2.880e-02  -4.209 2.61e-05 ***
## ownerSecond Owner           -7.404e-02  9.991e-03  -7.411 1.45e-13 ***
## ownerTest Drive Car          6.114e-01  1.317e-01   4.641 3.55e-06 ***
## ownerThird Owner            -1.087e-01  1.663e-02  -6.536 6.90e-11 ***
## fuel_efficiency              1.214e-02  1.956e-03   6.207 5.83e-10 ***
## engine                       2.566e-04  1.942e-05  13.213  < 2e-16 ***
## max_power                    8.500e-03  2.201e-04  38.616  < 2e-16 ***
## seats                        4.647e-02  6.311e-03   7.362 2.09e-13 ***
## make_categoryLuxury          5.779e-01  4.906e-02  11.778  < 2e-16 ***
## make_categoryMidrange        2.194e-01  3.336e-02   6.576 5.28e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2909 on 5272 degrees of freedom
## Multiple R-squared:  0.8474, Adjusted R-squared:  0.8469 
## F-statistic:  1830 on 16 and 5272 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(bic_full_add_model))
## [1] 17
#Leave one out cross validation RMSE 
sqrt(mean((resid(bic_full_add_model) / (1 - hatvalues(bic_full_add_model))) ^ 2))
## [1] 0.2914511
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_full_add_model,which=c(1,2))

# Diagnostic tests
bp_test = bptest(bic_full_add_model)
print(bp_test)
## 
##  studentized Breusch-Pagan test
## 
## data:  bic_full_add_model
## BP = 372.41, df = 16, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_full_add_model), size = 5000)
shapiro_test = shapiro.test(residual_subset)
print(shapiro_test)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.98094, p-value < 2.2e-16
# Multicollinearity test
vif(bic_full_add_model)
##                      GVIF Df GVIF^(1/(2*Df))
## year             2.190700  1        1.480101
## km_driven_in_10k 1.634226  1        1.278369
## fuel             2.588377  1        1.608843
## seller_type      1.112922  2        1.027108
## transmission     1.446943  1        1.202889
## owner            1.412292  4        1.044096
## fuel_efficiency  3.683989  1        1.919372
## engine           5.759182  1        2.399830
## max_power        3.026818  1        1.739775
## seats            2.419890  1        1.555599
## make_category    1.562362  2        1.118009
Adjusted R^2 Number of
parameters
Breusch-Pagan test Shapiro-Wilk test
(with 5k sample residuals)
VIF>5 Non significant
predictors(pvalue > 0.05)
LOOCV
RMSE
Full additive model with BIC
(Excluding “make” predictor)
0.8469423 17 Constant variance assumption
violated with BP = 372.41
Normality assumption violated
with W = 0.98173
None seller_typeTrustmark Dealer 0.2914511
  • The “full” linear model has a high \(R^2\) value of 0.8469423, indicating that 84.69% of the variance in selling_price_in_10k is explained by the predictors.

  • Most predictors are statistically significant (p-value < 0.05), suggesting they contribute meaningfully to the model. Except seller_typeTrustmark Dealer (p = 0.165779)

  • Diagnostic testing for this model:

    • The Breusch-Pagan test yielded a very high test statistic of 372.41 and a p-value less than \(2.2*10^{-16}\) which rejects the null hypothesis that constant variance exists in residuals

    • The Shapiro-Wilk test with 5000 sample residuals yielded a low p-value less than \(2.2*10^{-16}\), which rejects the null hypothesis that the residuals are normally distributed.

“Small additive model” with BIC-based stepwise feature selection

bic_small_add_model = step(lm(selling_price_in_10k ~ year + make_category + max_power + transmission + fuel + seller_type + owner +engine, data = train_data),direction="both",k=log(n),trace=0)

# Model summary
summary(bic_small_add_model)
## 
## Call:
## lm(formula = selling_price_in_10k ~ year + make_category + max_power + 
##     transmission + fuel + seller_type + owner + engine, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.66616 -0.16658  0.02659  0.19106  1.39914 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -2.362e+02  2.518e+00 -93.801  < 2e-16 ***
## year                         1.186e-01  1.248e-03  95.034  < 2e-16 ***
## make_categoryLuxury          5.800e-01  4.917e-02  11.796  < 2e-16 ***
## make_categoryMidrange        2.257e-01  3.358e-02   6.723 1.97e-11 ***
## max_power                    7.966e-03  2.098e-04  37.967  < 2e-16 ***
## transmissionManual          -8.779e-02  1.696e-02  -5.176 2.36e-07 ***
## fuelPetrol                  -1.740e-01  9.803e-03 -17.752  < 2e-16 ***
## seller_typeIndividual       -7.327e-02  1.407e-02  -5.209 1.97e-07 ***
## seller_typeTrustmark Dealer  7.134e-02  6.028e-02   1.183    0.237    
## ownerFourth & Above Owner   -1.188e-01  2.899e-02  -4.098 4.23e-05 ***
## ownerSecond Owner           -7.980e-02  1.002e-02  -7.962 2.05e-15 ***
## ownerTest Drive Car          5.996e-01  1.327e-01   4.518 6.39e-06 ***
## ownerThird Owner            -1.156e-01  1.671e-02  -6.920 5.07e-12 ***
## engine                       2.549e-04  1.290e-05  19.762  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2931 on 5275 degrees of freedom
## Multiple R-squared:  0.845,  Adjusted R-squared:  0.8446 
## F-statistic:  2211 on 13 and 5275 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(bic_small_add_model))
## [1] 14
#Leave one out cross validation RMSE 
sqrt(mean((resid(bic_small_add_model) / (1 - hatvalues(bic_small_add_model))) ^ 2))
## [1] 0.2935451
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_small_add_model,which=c(1,2))

# Diagnostic tests
bptest(bic_small_add_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  bic_small_add_model
## BP = 356.78, df = 13, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_small_add_model), size = 5000)
shapiro.test(residual_subset)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.98009, p-value < 2.2e-16
# Multicollinearity test
vif(bic_small_add_model)
##                   GVIF Df GVIF^(1/(2*Df))
## year          1.434890  1        1.197869
## make_category 1.537469  2        1.113529
## max_power     2.707912  1        1.645573
## transmission  1.431656  1        1.196518
## fuel          1.459458  1        1.208080
## seller_type   1.103696  2        1.024973
## owner         1.393515  4        1.042351
## engine        2.501522  1        1.581620

ANOVA of Small and Full additive models

anova(bic_small_add_model, bic_full_add_model)
## Analysis of Variance Table
## 
## Model 1: selling_price_in_10k ~ year + make_category + max_power + transmission + 
##     fuel + seller_type + owner + engine
## Model 2: selling_price_in_10k ~ (make + year + km_driven_in_10k + fuel + 
##     seller_type + transmission + owner + fuel_efficiency + engine + 
##     max_power + seats + make_category) - make
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   5275 453.14                                  
## 2   5272 446.00  3    7.1387 28.128 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • From the above Analysis of Variance(ANOVA) result we can say that Full additive model is statistically better than small additive model, based on the high F-test low p_value and lower RSS.
  • Full additive model has lower RSS (446.00) than small additive model (453.14), indicating that full additive model fits the data better.
  • Low p_value indicates that the small additive model is NOT sufficient in explaining the variance of the response variable.
  • Hence we reject the null hypothesis that H0: small additive model is sufficient in explaining the variance of the response variable.
  • We also understand that full model is more complex with 17 parameters is harder to interpret than the small model.
  • Hence we can select the small model since the difference in RSS is very small (7.14) and it is comparitively easier to interpret

Influential points

indicies_to_exclude = unname(which(cooks.distance(bic_small_add_model) > 4/length(cooks.distance(bic_small_add_model))))
train_data_filtered = train_data[-indicies_to_exclude,]
#Number of excluded observations
nrow(train_data) - nrow(train_data_filtered)
## [1] 283

Refitting the BIC Small additive model without influential points

n= nrow(train_data_filtered)
refitted_bic_small_add_model = lm(formula = formula(bic_small_add_model), data = train_data_filtered)

# Model summary
summary(refitted_bic_small_add_model)
## 
## Call:
## lm(formula = formula(bic_small_add_model), data = train_data_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.91665 -0.16073  0.02035  0.17474  0.91654 
## 
## Coefficients:
##                               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                 -2.398e+02  2.354e+00 -101.878  < 2e-16 ***
## year                         1.204e-01  1.167e-03  103.166  < 2e-16 ***
## make_categoryLuxury          5.844e-01  5.026e-02   11.627  < 2e-16 ***
## make_categoryMidrange        2.545e-01  3.380e-02    7.529 6.03e-14 ***
## max_power                    8.331e-03  1.909e-04   43.652  < 2e-16 ***
## transmissionManual          -7.914e-02  1.522e-02   -5.202 2.05e-07 ***
## fuelPetrol                  -1.661e-01  8.680e-03  -19.136  < 2e-16 ***
## seller_typeIndividual       -8.145e-02  1.250e-02   -6.514 8.02e-11 ***
## seller_typeTrustmark Dealer  3.674e-02  6.880e-02    0.534   0.5934    
## ownerFourth & Above Owner   -7.936e-02  3.096e-02   -2.563   0.0104 *  
## ownerSecond Owner           -6.684e-02  8.897e-03   -7.512 6.85e-14 ***
## ownerThird Owner            -9.338e-02  1.554e-02   -6.010 1.99e-09 ***
## engine                       2.653e-04  1.172e-05   22.648  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.253 on 4993 degrees of freedom
## Multiple R-squared:  0.8685, Adjusted R-squared:  0.8682 
## F-statistic:  2747 on 12 and 4993 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(refitted_bic_small_add_model))
## [1] 13
#Leave one out cross validation RMSE 
sqrt(mean((resid(refitted_bic_small_add_model) / (1 - hatvalues(refitted_bic_small_add_model))) ^ 2))
## [1] 0.2531199
# Diagnostic plots
par(mfrow = c(1, 2))
plot(refitted_bic_small_add_model,which=c(1,2))

# Diagnostic tests
bptest(refitted_bic_small_add_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  refitted_bic_small_add_model
## BP = 251.66, df = 12, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(refitted_bic_small_add_model), size = 5000)
shapiro.test(residual_subset)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.99136, p-value < 2.2e-16
# Multicollinearity test
vif(refitted_bic_small_add_model)
##                   GVIF Df GVIF^(1/(2*Df))
## year          1.427903  1        1.194949
## make_category 1.372681  2        1.082412
## max_power     2.453995  1        1.566523
## transmission  1.319415  1        1.148658
## fuel          1.455345  1        1.206377
## seller_type   1.088027  2        1.021315
## owner         1.356970  3        1.052192
## engine        2.415179  1        1.554085

“Interaction model” with BIC-based stepwise feature selection

  • From the above results we can see that the constant variance assumption is violated for all the above models.
  • Hence there could be some more possibilities to improve the model either through interaction terms or through predictor transformations.
  • In the following model the interactions terms are used based the results of data analysis
bic_int_model = step(lm(selling_price_in_10k ~ year + make_category + max_power*fuel + transmission*seller_type + owner + engine, data = train_data),direction="both",k=log(n),trace=0)

# Model summary
summary(bic_int_model)
## 
## Call:
## lm(formula = selling_price_in_10k ~ year + make_category + max_power + 
##     fuel + transmission + seller_type + owner + engine + max_power:fuel, 
##     data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.64759 -0.16287  0.02749  0.18978  1.40197 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -2.354e+02  2.522e+00 -93.335  < 2e-16 ***
## year                         1.182e-01  1.250e-03  94.613  < 2e-16 ***
## make_categoryLuxury          6.041e-01  4.952e-02  12.198  < 2e-16 ***
## make_categoryMidrange        2.229e-01  3.354e-02   6.646 3.32e-11 ***
## max_power                    7.604e-03  2.305e-04  32.985  < 2e-16 ***
## fuelPetrol                  -2.697e-01  2.730e-02  -9.882  < 2e-16 ***
## transmissionManual          -8.613e-02  1.695e-02  -5.082 3.86e-07 ***
## seller_typeIndividual       -7.119e-02  1.406e-02  -5.063 4.27e-07 ***
## seller_typeTrustmark Dealer  6.115e-02  6.026e-02   1.015 0.310308    
## ownerFourth & Above Owner   -1.169e-01  2.896e-02  -4.037 5.49e-05 ***
## ownerSecond Owner           -8.069e-02  1.001e-02  -8.059 9.42e-16 ***
## ownerTest Drive Car          5.309e-01  1.338e-01   3.968 7.36e-05 ***
## ownerThird Owner            -1.177e-01  1.670e-02  -7.048 2.05e-12 ***
## engine                       2.503e-04  1.294e-05  19.345  < 2e-16 ***
## max_power:fuelPetrol         1.125e-03  2.996e-04   3.756 0.000175 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2927 on 5274 degrees of freedom
## Multiple R-squared:  0.8454, Adjusted R-squared:  0.845 
## F-statistic:  2060 on 14 and 5274 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(bic_int_model))
## [1] 15
#Leave one out cross validation RMSE 
sqrt(mean((resid(bic_int_model) / (1 - hatvalues(bic_int_model))) ^ 2))
## [1] 0.2932303
# Diagnostic plots
par(mfrow = c(1, 2))
plot(bic_int_model,which=c(1,2))

# Diagnostic tests
bptest(bic_int_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  bic_int_model
## BP = 378.51, df = 14, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(bic_int_model), size = 5000)
shapiro.test(residual_subset)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.97766, p-value < 2.2e-16
# Multicollinearity test
vif(bic_int_model)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
##                     GVIF Df GVIF^(1/(2*Df))
## year            1.442807  1        1.201169
## make_category   1.602363  2        1.125098
## max_power       3.277862  1        1.810487
## fuel           11.341871  1        3.367769
## transmission    1.432631  1        1.196926
## seller_type     1.108617  2        1.026113
## owner           1.422868  4        1.045070
## engine          2.523906  1        1.588680
## max_power:fuel  9.698828  1        3.114294

ANOVA of Small additive model and Interaction model

  • We cannot compare the refitted_bic_small_add_model with interaction model because the data sets will differ and the anova test will give an error.
anova(bic_small_add_model, bic_int_model)
## Analysis of Variance Table
## 
## Model 1: selling_price_in_10k ~ year + make_category + max_power + transmission + 
##     fuel + seller_type + owner + engine
## Model 2: selling_price_in_10k ~ year + make_category + max_power + fuel + 
##     transmission + seller_type + owner + engine + max_power:fuel
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   5275 453.14                                  
## 2   5274 451.93  1    1.2087 14.106 0.0001747 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Statistically low p_value suggests we can reject the null hypothesis that bic_small_add_model is sufficient in explaining the response variable and the interaction model is better.
  • Again, here the difference in the RSS of the model is very low.(1.21)
  • Hence we can go ahead with bic_small_add_model or in turn use refitted_bic_small_add_model, which has comparatively low test statistic of BP = 251.66 when compared to the interaction model which has a BP=378.51. (Higher the value of test statistic, lower the p_value)

Weighted least square approach

  • In all of the above models we can see that the constant variance assumption is violated.
  • To resolve this we can try weighted linear regression approach
  • In weighted linear regression, observations with higher weights contribute more to the estimation of the regression coefficients, while observations with lower weights have less influence.
  • This approach is often used when there is non-constant variance in the residuals or when certain observations are more reliable or important than others.
fitted = fitted(refitted_bic_small_add_model)
weights = 1 / (fitted^2)
weighted_model = lm(formula=formula(refitted_bic_small_add_model),data=train_data_filtered,weights=weights)

# Model summary
summary(weighted_model)
## 
## Call:
## lm(formula = formula(refitted_bic_small_add_model), data = train_data_filtered, 
##     weights = weights)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31567 -0.04284  0.00591  0.04728  0.33094 
## 
## Coefficients:
##                               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)                 -2.312e+02  1.993e+00 -115.991  < 2e-16 ***
## year                         1.161e-01  9.890e-04  117.377  < 2e-16 ***
## make_categoryLuxury          5.689e-01  6.183e-02    9.201  < 2e-16 ***
## make_categoryMidrange        2.486e-01  3.143e-02    7.910 3.13e-15 ***
## max_power                    8.558e-03  2.143e-04   39.939  < 2e-16 ***
## transmissionManual          -7.947e-02  1.820e-02   -4.367 1.29e-05 ***
## fuelPetrol                  -1.530e-01  9.200e-03  -16.631  < 2e-16 ***
## seller_typeIndividual       -8.964e-02  1.429e-02   -6.272 3.87e-10 ***
## seller_typeTrustmark Dealer  2.574e-02  8.374e-02    0.307    0.759    
## ownerFourth & Above Owner   -1.082e-01  2.442e-02   -4.430 9.61e-06 ***
## ownerSecond Owner           -7.745e-02  8.799e-03   -8.803  < 2e-16 ***
## ownerThird Owner            -9.627e-02  1.388e-02   -6.936 4.54e-12 ***
## engine                       2.617e-04  1.297e-05   20.175  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07421 on 4993 degrees of freedom
## Multiple R-squared:  0.8962, Adjusted R-squared:  0.8959 
## F-statistic:  3592 on 12 and 4993 DF,  p-value: < 2.2e-16
#Number of parameters 
length(coef(weighted_model))
## [1] 13
#Leave one out cross validation RMSE 
sqrt(mean((resid(weighted_model) / (1 - hatvalues(weighted_model))) ^ 2))
## [1] 0.2536116
# Diagnostic plots
#par(mfrow = c(1, 2))
plot(weighted_model,which=c(1,2))

# Diagnostic tests
bptest(weighted_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  weighted_model
## BP = 135.18, df = 12, p-value < 2.2e-16
# Subset residuals for Shapiro-Wilk test (up to 5000 samples)
residual_subset = sample(residuals(weighted_model), size = 5000)
shapiro.test(residual_subset)
## 
##  Shapiro-Wilk normality test
## 
## data:  residual_subset
## W = 0.99066, p-value < 2.2e-16
# Multicollinearity test
vif(weighted_model)
##                   GVIF Df GVIF^(1/(2*Df))
## year          1.549228  1        1.244680
## make_category 1.226144  2        1.052290
## max_power     2.461797  1        1.569012
## transmission  1.225061  1        1.106825
## fuel          1.577181  1        1.255859
## seller_type   1.072007  2        1.017535
## owner         1.370854  3        1.053979
## engine        2.552408  1        1.597626

Results

Discussion

Appendix